DOCUMENT RESUME 



ED 469 376 

AUTHOR 

TITLE 

INSTITUTION 
REPORT NO 
PUB DATE 
NOTE 
PUB TYPE 
EDRS PRICE 
DESCRIPTORS 



ABSTRACT 

including alignment of tests and standards, the integration of tests with 
curriculum and instruction, the quality of the tests, and a clear definition 
of the purpose of the test. To address the various needs reflected by these 
issues, PLATO Learning, Inc., offers two curriculum-wide testing systems : 
NetSchools "Orion" GATE and PLATO (registered) LINK. For practice in 
preparation for high-stakes tests, PLATO Learning offers the Simulated Tests 
in mathematics, reading, and writing. To support needs for placement, 
progress control, and cumulative testing when using PLATO (registered) 
courseware, PLATO Learning offers the FASTRACK and Skills Inventory systems, 
module mastery tests, and course-level assessments. Each of these systems has 
different characteristics and is designed to serve different needs. Choosing 
among them involves answering 12 key questions about testing needs, which are 
provided in this document. (Author /SLD) 



TM 034 526 

Foshay, Rob 

Choosing the Right Testing Option in PLATO Courseware. PLATO 
Technical Paper. 

PLATO Learning, Inc., Bloomington, MN. 

PLATO-TP-13 
2002-08-00 
60p . 

Reports - Descriptive (141) 

EDRS Price MF01/PC03 Plus Postage. 

Achievement Tests; Computer Software; Curriculum; 
Mathematics; Needs Assessment; ^Reading; ^Selection; Student 
Placement; Test Coaching; *Test Use; ^Writing (Composition) 



There are a number of issues to consider in choosing tests, 



Reproductions supplied by EDRS are the best that can be made 
from the original document. . 



TM034526 ed 469 376 



j 




Technical Paper #13 



August, 2002 



Choosing the Right Testing 
Option in PLATO Courseware 



PERMISSION TO REPRODUCE AND 
DISSEMINATE THIS MATERIAL HAS 
BEEN GRANTED BY 



WrRrFoshay — 



Rob Foshay, Ph.D. 
Vice President 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC) 

1 



Instructional Design and Cognitive Learning 



PLATO Learning, Inc. 

1 080 1 Nesbitt Avenue South 
Bloomington, MN 55437 
(800) 869-2000 
http://www.plato.com 
author’s e-mail: 
rfoshay@plato.com 



U S DEPARTMENT OF EDUCATION 
Office of Educational Research and Improvement 
DUCATIONAL RESOURCES INFORMATION 
/ CENTER (ERIC) 

M his document has been reproduced as 
received from the person or organization 
originating it. 

J Minor changes have been made to 



• Points of view or opinions stated in this 
document do not necessarily represent 
official OERI position or policy. 



COPYRIGHT © PLATO LEARNING, INC., 2000. ALL RIGHTS RESERVED. MAY BE DUPLICATED AND DISTRIBUTED, WITH CREDIT TO PLATO 
LEARNING, INC. 




2 



BEST COPY AVAILABLE 



ABSTRACT 



Any effective educational enterprise must include measurement of its learning 
outcomes, before, during and after instruction. Standards and accountability 
requirements place even more emphasis on assessment. But there are a number of 
issues to consider when choosing tests. Main ones include: 

• Alignment of tests and standards 

• Integration of tests with curriculum and instruction 

• Quality of tests, including validity issues such as detailed alignment with 
standards, as well as reliability 

• Clear definition of the purpose of the test, including high vs. low stakes, 
two types of placement testing needs, progress control, and two types of 
cumulative post-tests. 

To address the various needs emerging from these issues, PLATO Learning, Inc. 
offers two curriculum- wide testing systems, NetSchools Orion GATE, and 
PLATO® LINK. For practice in preparation for high-stakes tests, PLATO 
Learning offers the Simulated Tests in math, reading and writing. To support 
needs for placement, progress control and cumulative testing when using 
PLATO® courseware, PLATO Learning offers the FASTRACK and Skills 
Inventory systems, module mastery tests, and course-level assessments. Each of 
these systems has different characteristics and is designed to serve different needs. 
Choosing among them involves answering 12 key questions about your testing 
needs. 
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Introduction 



Any effective educational enterprise must include measurement of its learning 
outcomes, before, during and after instruction. Standards and accountability 
requirements place even more emphasis on assessment. One result is that an 
increasing number of tests have been layered onto the curriculum, often without 
enough attention to how well they reflect standards or how well they measure. 
Furthermore, data obtained from the tests are often so general and so slow in 
coming that they are of little use to teachers and administrators who need to make 
decisions about individual learners, classes and schools. The net result can be 
highly misleading information about program effectiveness: the wrong test, 
reported at the wrong time, gives disinformation to educators, policy makers and 
the community. 

In response to this problem, PLATO Learning, Inc. has recently expanded the 
capabilities of the PLATO® family of technologies for testing. Our goal is to 
provide educators at all levels with high-quality, standards-aligned, competency- 
based, online testing systems, with immediate online reporting, for the full range 
of low-stakes testing needs. The testing systems support both the PLATO® 
instructional systems and the full core curriculum. Many of the testing systems 
are customizable, and some may be used independently of the PLATO courseware 
if desired. 

This technical paper will first review key issues in testing and discuss their 
relationship to effective implementation of standards under No Child Left Behind. 
Then it will provide an overview of each testing capability provided by the 
PLATO® technologies. Finally, a guide to choosing among the testing options 
will help you choose among the options PLATO Learning provides. 
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Part 




Using Tests to Support Effective Standards Implementation 



In this part, we’ll first discuss five common issues in standards implementation 
which surround use of tests. Then we’ll discuss the types of tests that are needed 
to implement standards. We will then briefly summarize the testing requirements 
of No Child Left Behind. Finally, we’ll discuss issues of reliability and validity as 
they affect the types of tests needed for standards implementation, and provide an 
overview of validity and reliability procedures used for PLATO tests. 



5 Common Testing Issues in Standards Reform 



Tests are by far the most common means of assessment in education, but their 
very familiarity often leads educators to overlook issues in test design that take on 
particular significance when implementing a standards-based system for 
accountability. These issues concern: 

• Confusion over test types 

• Disconnected curricula and tests 

• Technically poor tests 

• “The tail wags the dog” syndrome 

• Testing overload 

We’ll discuss each of these in turn. 

Confusion over test types 

Educators are most familiar with norm-referenced tests (often called “standardized 
tests”), yet adequate measurement of standards requires criterion-referenced tests. 
The distinction between the two is not widely understood. 

A norm-referenced test is designed to compare achievement of each learner to a 
reference group, such as a national sample of students at the same grade level. 
Questions on a norm-referenced test are chosen because they are of moderate 
difficulty for students at that grade level: if a question is too hard or too easy, it is 
eliminated because it doesn’t do a good job of classifying students. The content of 
a norm-referenced test is determined by a domain specification which carefully 
defines the boundaries of the content area to be measured. For example, grade 
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levels in reading are norm-referenced. “Grading on the curve” is a norm- 
referenced practice. Tests such as the Iowa Tests of Basic Skills and the Stanford 
Achievement Test are norm-referenced. 

A criterion-referenced test is designed to map how well learners can perform (or 
understand) a particular benchmark for a standard. The difficulty of the questions 
included is determined by the benchmarks to which they correspond: each 

question will be only as easy or hard as is needed to properly measure the 
benchmark. Students who have fully mastered the standards should find the test 
easy. In the criterion-referenced world of standards, if the system is working, 
everyone should get an “A.” Even better than a letter grade, however, is a 
checklist showing which benchmarks they have (and have not) attained - in effect, 
a separate “pass/fail” decision on each benchmark. 

The content of a criterion-referenced test is usually competency-based, meaning 
that it takes as its content map the standards and benchmarks to be measured. 
Rather than mapping the “boundary” of the content domain as a norm-referenced 
test does, a criterion-referenced test maps the whole of the content area. State 
standards tests are usually competency-based and criterion referenced, and Federal 
policy under No Child Left Behind encourages states now using norm-referenced 
tests to transition to competency-based ones.(Marzano and Mid-Continent 
Regional Educational Lab. Aurora CO. 1998) 

You can’t simply look at a test and tell whether it is norm-referenced or 
competency-based. The question types on the two tests usually are the same, and 
the items are scored in the same way. There might be differences in details of 
what is tested or in item difficulty which wouldn’t be apparent on a casual 
inspection. The basic distinction really is with the interpretation of the score: a 
norm-referenced test classifies learners relative to other learners; a competency- 
based test makes “yes or no” decisions on whether the learner has mastered each 
competency tested. (Linn, National Council on Measurement in Education, et al. 
1993) 

Disconnected curricula and tests 

Truly implementing state curriculum standards in the daily practice of the 
classroom is a tremendous challenge. Most educators confront a disconnected 
curriculum structure: National standards (such NCTM) and tests (such as the 
National Assessment of Educational Progress, or the Scholastic Aptitude Test) 
show major discrepancies between each other and with the various state standards. 
Often, the state standards and tests do not align well with each other(Marzano and 
Kendall 1996). Furthermore, the standards and benchmarks themselves are often 
of poor technical quality (Kendall 2001), and educators find them difficult to 
interpret at the level of detail needed for daily lesson planning and testing. The 
sheer volume of standards is itself an issue. National and state content standards 
would require students to master one benchmark per day in every subject - an 
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unrealistic goal (Marzano and Kendall 1996), so schools must choose what 
standards and benchmarks to implement. 

In this climate, it is scarcely surprising that simply figuring out what to teach and 
test is extremely difficult. Most teachers, confronted with a thick binder of 
standards which may suffer from these weaknesses, find it impossible to actively 
use them as a guide to daily teaching and testing. The result is that what happens 
in the classroom often has only an indirect relationship to standards. 

Generations of educational practice often lead well-meaning professionals to focus 
on the wrong things. The essence of the standards movement is to define success 
in terms of learning outcomes (what the students can do) rather than delivery of 
instruction (what the teachers do). Yet, many standards documents focus on 
content rather than performance. Furthermore, teachers often develop a repertoire 
of activities which seem to work well with their learners, and they may be 
reluctant to change, even if the activities have little relationship to curriculum 
standards. When teachers develop their tests based on these activities, the 
influence of standards is lost. 

Administrative practices also reinforce the disconnections. Schools usually 
standardize instruction, rather than learning: everyone gets the same number of 
“contact hours” in each subject (leading to the familiar “bell-shaped curve” of 
achievement), rather than creating a system which does whatever it takes so 
everyone reaches the same learning outcomes (leading to a “bell shaped curve” of 
instructional time). Even the familiar Carnegie Unit is defined in this way. This 
virtually insures that mastery of standards by all learners will not occur. Grading 
systems are based on the norm-referenced idea of “grading on the curve” rather 
than the competency-based framework of standards. Even when tests are 
standards-referenced, there are often delays of months in reporting results, and the 
reports often lack the detail needed to plan interventions with particular students. 

The result is a system which in which the teaching practices, the content, the class 
time, the teaching, the administrative practices and the tests are all disconnected 
from each other and from curriculum standards. 

Technically Poor Tests 

Researchers have shown that many state tests are of poor technical quality 
(Marzano and Kendall 1996). They often have only a modest correspondence to 
the standards they are intended to test, and they may not have gone through a 
sufficiently rigorous item development process to justify the tests’ use in high- 
stakes situations such as determining eligibility to graduate. Some states have 
adopted norm-referenced standardized tests, rather than developing their own. 
However, a norm-referenced test is not an adequate measure in the competency- 
based world of standards, and Federal policy under No Child Left Behind is to 
move toward competency-based tests. 
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District and classroom tests typically are for low-stakes purposes, and the 
development cost and effort can be lower than for the high-stakes state tests. 
However, teacher-made tests are of notoriously uneven quality, often with 
overemphasis on questions which are low on Bloom’s Taxonomy of the Cognitive 
Domain, and which exhibit many stylistic flaws which depress test quality. Since, 
as argued above, these tests are often disconnected from standards as well, the 
result is that the ’’first line” of measurement of learning in the classroom is often 
highly misleading. It’s little wonder that administrators are vulnerable to “nasty 
surprises” when students who have been testing well all year on the local measures 
suddenly have difficulty on the high-stakes state tests. 

It is also clear that not all benchmarks can be adequately measured with multiple- 
choice questions (Marzano and Kendall 1996). Educators must be clear about 
what can - and cannot - be measured by multiple-choice tests, and a complete 
measurement of standards must include the more labor-intensive performance- 
based measures in a portfolio evaluation approach, even though reliability of these 
measures is often lower than a well-designed multiple-choice test. 

‘The Tail Wags the Dog” Syndrome 

The problems with disconnected curricula and poor tests lead to a further 
difficulty: “teaching to the test” - the wrong way. The problem starts with the 
principle that since it’s impractical to test for all the learning outcomes referenced 
in the standards, the state tests must use a domain sampling procedure. In other 
words, each test is a random sample of just a few of the many questions which 
correspond to the standards. In Fig. 1 (next page), the domain (described by the 
standards) is represented by the circle on the left, and the random sample of items 
are the squares within. These are then assembled into a test, as shown by the 
column of squares on the right. 

The problem comes when teachers, seeking guidance on what to teach, look at the 
test (the domain sample) rather than the curriculum standards (the domain). They 
conclude that “this must be what they mean” by the standards, and follow the test. 
However, since the test is only a domain sample, the disconnect with the standards 
is still a problem. Furthermore, when a new form of the test is used (perhaps in 
the next semester), it has a different domain sample, and the teacher will complain 
that “they changed the test,” thus betraying both teacher and students. 
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Test 



• Fig. 1 : Teaching to the test 



“Teaching to the test” would be acceptable - if the test were a perfect and 
complete measure of the standards (the domain). But most tests are domain 
samples, and no test is a perfect measure of its standards. Therefore, the goal 
should be to teach to the standards, not to the test. 



Standards and accountability reforms have often been treated as additive, rather 
than displacing old policies. One result of this trend has been that a typical 
curriculum has many “layers” of testing, often including national tests such as the 
NAEP, SAT, ACT, state standards tests, norm-referenced tests used by the district 
as pre- and post-tests, district-mandated final exams, various diagnostic and 
placement tests, departmental unit tests, and each teacher’s own unit tests and 
quizzes. It’s little wonder that teachers complain that so much time is occupied by 
testing that there’s scant room left for teaching. The problem is exacerbated by the 
disconnections between the tests, as discussed above. It’s nearly impossible for a 
teacher to figure out which tests are providing data which actually relates to 
standards. 



The issues of test quality, disconnects with the curriculum, the “tail wagging the 
dog,” and test overload often combine to produce a no-win situation for teachers. 
Faced with uninterpretable standards, misaligned tests of variable quality, and the 
imperative to improve test scores, teachers often become frustrated and even 
cynical about testing in general: they feel they are in a no-win situation. 



Testing Overload 



Fixing a “ncnwin” situation 
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Fixing the problem is at the core of successful standards implementation: 

• Schools should insist that their state standards be technically sound, with 
clear definitions of domains, benchmarks and performance levels — in 
numbers realistic to teach in the time available. Where this is not the case, 
schools will need to develop their own “interpretation” of their state’s 
standards. 

• Schools and state policy makers should make sure all assessments (high 
and low-stakes) are aligned to their standards and benchmarks - and 
eliminate those tests which are not. 

• Schools should define what to teach in terms of competency-based 
benchmarks and objectives, not in terms of tests. 

• Leaders should focus attention on student attainment of the standards, not 
simply on quality of teaching. 

• Leaders should make sure high- and low-stakes assessments are of the 
appropriate types, that they are competency-based and criterion- 
referenced. 

• Within the practical limits of assessment length and cost, leaders should 
make sure all tests are of appropriate quality for their purpose. 

• Schools should arrange for timely and detailed (disaggregated) reports 
from all tests, so they can be used for data-based decision-making. 
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Types of Tests 



Five types of tests are commonly used in schools. They differ by purpose, 
content, and quality/cost/length requirements. The test types are shown in Fig. 2. 
We’ll discuss each type separately. , 




Accountability 

Placement Progress Control Accreditation 

Admission 



• Fig. 2: Types of Tests 



Lesson Quiz/Mastery Tests 

The most familiar kind of test, these tests are usually given in close proximity to 
instruction - often at the end of the lesson (as is the case with PLATO 
courseware). The purpose of the test is to provide the learners with quick 
feedback on how well they understand what was just studied, and to provide the 
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instructor with rapid information which can be used to decide what to do next: 
reteach, or go on to the next lesson. In mastery learning, these tests are used to 
regulate progress through the learning sequence. 

Content of the test should reflect the lesson’s terminal objective(s). This close 
linking of instruction and test is especially important here, because the tests should 
be used for data-based decision-making to guide each individual learner. 

In general, in individualized instruction, it is better to give many short “testlets,” 
tightly linked to instruction, rather than a few longer tests. The “testlets” can be 
short quizzes, which only focus on a single terminal objective, and which are very 
short (research on test length suggests that 8 questions per objective provides the 
best reliability). This allows for immediate identification of learning problems and 
immediate intervention. 

It is important for these tests to be a highly valid measure of what is taught in the 
instruction. If the instruction is in turn defined by standards, then the lesson quiz 
will be a valid measure of the corresponding benchmark or objective. 

This is a low-stakes test, since the only consequence of measurement error would 
be to waste a small amount of the learner’s time (by reteaching or by going on 
inappropriately). Therefore, reliability can be moderate, and the test cost and test 
length can be modest. 

While these tests are commonly graded by percentage and used as a basis for letter 
grades, in the competency-based, criterion-referenced world of standards, it makes 
more sense for these tests to be graded as “pass/fail” (actually “mastered/not 
mastered”) as evidence of mastery of the corresponding objective or benchmark. 

Pretest for need 

A pretest for need tests the terminal objective(s) of the instruction to follow. It can 
serve three purposes: 

• Placement in the instruction, to determine that the learner does not already 
know what is to be taught. 

• Routing around the instruction, in individualized instruction, by allowing 
learners to skip instruction on what they already know. 

• Evaluation of effectiveness of the instruction, where there is a need to 
demonstrate learning gain. In this type of evaluation, this kind of pretest 
provides the beginning level of understanding of what is to be taught, and 
it can detect problems with learners who know too much or too little to be 
representative of those who are intended to be in the study. 
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The closer to the beginning of the instruction this test is administered, the better 
the measurement will be of learning at the beginning of the instruction. 

Like the lesson quiz, it is important for these tests to be a highly valid measure of 
what is taught in the instruction. If the instruction is in turn defined by standards, 
then the pretest for need will be a valid measure of the corresponding benchmark 
or objective. 

When these tests are used for placement or routing, they are low-stakes, since the 
only consequence of measurement error would be to waste a small amount of the 
learner’s time (by teaching or skipping a lesson inappropriately). Therefore, 
reliability can be moderate, and the test cost and test length can be modest. 

When pretests of need are used to provide baseline data for evaluation of 
instruction, it may be appropriate to use a high-stakes test in order to get the 
superior reliability offered - if one is available which corresponds closely to the 
content taught. In standards-referenced curricula, this usually rules out use of 
norm-referenced tests; a state standards test (which is competency based) may be 
an appropriate measure, if administration is at the right time (such as the beginning 
of a semester). 

These tests usually vary in length, depending on their purpose. Pretests of need 
used for routing are often very short and are given immediately before each lesson. 
They can simply be an alternate form of the lesson quiz, but with different 
questions. 

Pretests of need used for placement and evaluation tend to be somewhat longer 
because they often cover more than one lesson’s content. They can simply chain 
together a number of lesson quizzes (with questions different from those used in 
the actual lesson quizzes), or they can be specially written for the purpose. 

To minimize testing time, for many curricula the PLATO® system offers the 
FASTRACK tailored testing system as a pretest of need used for placement. It is 
described more fully in Part 4 of this paper. 

Pretest for readiness 

A pretest for readiness tests the mastery of prerequisite knowledge of the 
instruction to follow. It can serve three purposes: 

• Admission to the instruction, to show that the learner has mastered 
knowledge and skills assumed by the instruction to come. 

• Diagnosis of deficits in prerequisites, for assignment to individualized 
instruction. 
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• Evaluation of instruction, if the evaluation plan requires a check to be sure 
that learners are ready for the instruction. Typically, this kind of test is 
done in combination with a pretest of need. 

The closer to the beginning of the instruction this test is administered, the better 
the measurement will be of the learners state at the beginning of the instruction. 

All instruction assumes a starting point of some prior knowledge on the part of 
learners. It is important for pretests of readiness to be a valid measure of what 
these assumptions are by the instruction to come. If the instruction is in turn 
defined by standards, then the pretest for need should be a valid measure of the 
corresponding benchmark(s) or objective(s) which are prerequisite to what is 
about to be taught. This test content is very different from the pretest of need, 
discussed above. 

When readiness tests are used diagnostically, they are low-stakes, since the only 
consequence of measurement error would be to waste a small amount of the 
learner’s time (by teaching or skipping a lesson inappropriately). Therefore, 
reliability can be moderate, and the test cost and test length can be modest. 

When pretests of readiness are used for admission, they may be high-stakes (as in 
college admissions). In this case, test reliability must be high, and the cost and 
length of development and delivery can justifiably be high. However, if the 
admissions decision is relatively low-stakes (such as admission to an enriched or 
advanced placement program), it may be appropriate to use moderate-reliability 
tests with their shorter length and lower cost. 

When pretests of readiness are used for evaluation of instruction, the purpose is to 
verify that all learners are ready for the instruction to follow. This is an important 
evaluation consideration, because research such as Bloom’s (Bloom 1982) has 
shown that approximately half of the variation in achievement found at the end of 
instruction is related to variation in achievement at the beginning of instruction. In 
most cases, however, this is a low-stakes use and the test can be relatively short 
and low cost. 

Cumulative (course and unit) tests 

The purpose of cumulative tests administered at the end of a unit or course is to 
measure longer-term retention of what was taught, and (if course content builds on 
itself) to measure the learner’s ability to integrate the content of the whole unit or 
course. 

Cumulative tests given at the end of units and courses typically sample the 
terminal objectives of the lessons in the unit or course, because an exhaustive test 
would be too long. If the content builds on itself (so that, for example, a learner 
who has mastered lesson 5 must also have mastered lessons 1-4), then it is 
appropriate for the cumulative test to concentrate on the higher level (and later) 
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content in the unit or course. If the content does not build on itself, then the 
cumulative test should randomly sample all the content taught. 

Instructors often give more weight to unit and course tests than they do to lesson 
quizzes and mastery tests, but this may not be appropriate. Often lesson-level tests 
have many more questions per objective, and taken together they test course 
content more thoroughly than usually is possible with a cumulative test. 

Validity of cumulative tests should be judged by how closely tied they are to the 
course objectives - which in turn should be tied closely to appropriate standards 
and benchmarks. The tests should therefore be competency-based and criterion- 
referenced. 

Cumulative tests are generally low-stakes (since the usual consequence of a 
measurement error is a grade which is too low or too high), so reliability can be 
modest. This permits test length and cost to be relatively low. 

Certification & Standardized Tests 

State competency tests and other tests which certify attainment of standards or a 
defined level of achievement also are given at the end of a block of instruction. 
They typically are used for high-stakes purposes, such as promotion, graduation, 
or employment. With accountability now national policy, they also are used to 
demonstrate quality of the educational system, and the jobs of school 
administrators and teachers may depend on these tests - a high-stakes use, if ever 
there was one! 

Validity of these tests should be determined by demonstrating how closely they 
are tied to standards. The tests should be competency-based and criterion- 
referenced. Norm-referenced tests (also called standardized tests 1 ) are defined by 
a domain specification which is not based on standards, so they should not be used 
to certify mastery of standards. 

Note, however, that practicalities of test length limit these tests to a domain 
sampling strategy. Thus they are not suitable for detailed instructional 

management decisions because they do not test every detail of a standard or 
benchmark. The information they provide is at a level of detail suitable only for 
general decisions, not detailed diagnostic prescriptions. 

Other factors limiting the utility of these tests is how infrequently they are given, 
lack of detail in reports, and delays in receiving reports - often months after test 
administration. These factors make the test results nearly useless for making real- 
time instructional management decisions. 



1 It is fairly common, but technically incorrect, to refer to all high-stakes tests as "standardized. 1 ' The process of 
standardization, also called norming, applies only to norm-referenced tests. 
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Note that the importance of these tests has created the need for similar “practice” 
tests which can be administered on demand and scored immediately with detailed 
prescriptive reports. 

Reliability of certification and standards tests should be high, with the associated 
requirements for relatively long tests and high costs of development. However, 
there has been considerable criticism of the validity and reliability of many state 
standards tests (Marzano and Kendall 1996). Use of low-reliability, low-validity 
tests for decisions with major consequences has resulted in successful litigation in 
other circumstances, and the same may happen with the state tests. 



Testing and No Child Left Behind 



The recent Federal No Child Left Behind Act (NCLBA) raises the stakes for 
educators, and makes effective implementation of standards-aligned curricula even 
more critical. The Federal legislation has established these goals for the states 
which relate to tests: 

• Development of competency-based tests which align to standards 

• Near-mandatory participation in the National Assessment of 
Educational Progress (NAEP), with the expectation that state tests 
will be consistent with the NAEP tests. 

• Annual testing at every grade level 

• Over a period of years, expansion of standards and tests to include 
science. 

• Disaggregated reporting of progress, by subgroups of students 
identified by race, ethnicity, handicap, and economic status. 

The danger is that educators may treat these new testing requirements as 
additional “layers” which only take time away from the curriculum, as 
discussed above. The intent of the law, however, is to drive standards into the 
everyday practice of every classroom at every grade level. 

This makes it even more important for schools to assure that all their daily 
testing practice is aligned to standards and of adequate quality to allow 
effective data-based decision-making. Since teachers vary widely in their test 
writing skills, this is a major challenge. Furthermore, aligning the tens of 
thousands of test items in use in a typical school district to standards is a 
daunting task. 

Automated systems, such as PLATO® Orion GATE and PLATO® Link, can 
help. By providing banks of tens of thousands of pre-aligned, carefully 
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reviewed and quality controlled test items, teachers have at their disposal a 
flexible, customizable and powerful tool to make standards-aligned testing a 
reality in everyday classroom use. 

Automated testing has additional important advantages to teachers and 
administrators. For teachers, the systems make it possible to administer 
individualized tests for each learner, thus supporting self-pacing and 
individualization. In addition, test scoring and record-keeping are entirely 
automated - a huge time saver. For administrators, the systems provide real- 
time, daily updates on progress toward standards, for the entire district or 
broken down by school, grade, classroom, or subgroup of students. This 
makes it possible to spot areas in need of assistance immediately - thus 
avoiding nasty surprises when annual testing time comes around. 

A further implication concerns implementation of mastery learning. 
NCLBA’s requirement that 100% of students meet appropriate standards 
places considerable emphasis on programs of instruction which are flexible 
enough to adapt to the needs of every individual student. Doing so requires 
teachers to make frequent - even daily - decisions about what each individual 
student should be doing: literally hundreds of instructional management 
decisions each day. Furthermore, the decisions must be made based on data 
which shows what every individual student has learned, and where problems 
in understanding may exist. Simply collecting this data, marking the papers, 
and maintaining the needed records is a task of which only superhuman 
teachers are capable, if it is done manually. This is why paper-based mastery 
learning systems almost inevitably collapse under their own weight. 

Automating testing, marking and reporting makes mastery learning possible. 
Furthermore, the PLATO courseware makes possible the complete 
individualization of pacing which is essential to mastery learning. Taken 
together, these technologies free the teacher to move to a “guide on the side” 
role in an effective mastery learning system. 



Reliability and Validity Considerations for Standards-Referenced Tests 



In the standards-referenced world, all tests - pre-, during, and post-instruction -- 
should be competency-based and criterion-referenced. While we might wish all 
tests to be perfectly valid and reliable, issues of test cost and length require instead 
that educators balance the requirement for validity and reliability against the 
purpose of the test. This is the significance of the distinction between “high 
stakes” and “low stakes” used above. 

Test quality: high-stakes and low-stakes tests 

Tests are used to make decisions. If the decision has major consequences, such as 
admission to a school or educational program, graduation, employment or 
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promotion, then stakes are high: the consequences of an incorrect decision are 
important and lasting. If the decision has minor consequences, such as placement 
within a curriculum, control of progress through a curriculum, or cumulative 
testing for a course grade, then stakes are low: the consequences of an incorrect 
decision are usually limited to some wasted time. For example, if learners are 
placed at too low a level in a curriculum, they might be assigned some unneeded 
study. 

The distinction is important for a number of reasons. Recommendations of 
standard professional practice in testing (American Educational Research 
Association., American Psychological Association, et al. 1999) suggest that tests 
of only the highest quality (and cost and length) be used for high-stakes tests, and 
this recommendation has frequently been supported in civil court cases. Low- 
stakes tests, on the other hand, are not held to such high standards of quality, cost 
and length. 

Test quality is expressed in terms of validity and reliability. Both concepts guide 
the interpretation of the data generated by the test, rather than the appearance of 
the test items themselves. Tests of high quality (high stakes tests) have higher 
validity and reliability, because the tests have been developed using a great deal of 
research as well as technical care. This is what makes high quality tests costly to 
develop, and often relatively lengthy. By contrast, a moderate-quality test (low- 
stakes test) rarely is developed with a similar level of research, and development 
cost is considerably lower. 

Validity 

As discussed above, in the standards-referenced world validity is determined by 
how well a test corresponds to standards. There is no simple statistical procedure 
to prove validity: 

There is an implicit two-step rationale : First, relevant knowledge and skill 
important to domain performance are delineated by means, for example, of job 
or task analysis or other sources of domain theory; second, construct-valid 
measures of the important knowledge and skill are selected or developed. Test 
items and tasks are deemed domain relevant because they ar4e presumably 
construct-valid measures of relevant domain knowledge and skill or, more 
generally, of relevant domain processes and attributes (Linn, National Council 
on Measurement in Education, et al. 1993). 

Therefore, it is important to establish review procedures by which trained panels 
of experts compare items to their standards and judge if they correspond in 
content, Taxonomy level, and difficulty. 

Tests also are sometimes shown to have bias by favoring certain genders, 
ethnicities, economic groups and student populations. These are questions of 
validity: a perfect test would measure only the knowledge it is designed to, and all 
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test takers who know that knowledge would perform the same way on the test. 
Precautions during test writing and editing can minimize these sources of validity 
issues, and test item reviewers should be trained to address these issues. 

Sometimes research projects are used to establish statistically the validity of a test. 
However, these methods for establishing validity generally are appropriate only 
for high-stakes tests. For example, to guard against bias in high stakes tests it may 
be worthwhile to compare scores on the test among equivalent student populations 
of various profiles, in order to detect differences. An advanced statistical 
technique for scoring tests, called item characteristic curve modeling, is possible 
only on high quality tests, but has yet to be applied to state standards tests (though 
the next generation of tests may do so). 

The most common way to improve validity of a norm-referenced test is to give it 
to different groups of students and compare the results to see if the test mirrors the 
actual differences between the students, as measured by some other means. For a 
competency-based test, this method is less useful; instead, the best way to improve 
validity is to see how well the test questions map back to their corresponding 
standards and benchmarks. This is usually done with expert raters. 

Reliability 

Reliability is the degree to which a test works in a predictable way. In principle, if 
you give a test to a particular learner twice, a perfectly reliable test would always 
produce the same score 2 . Thus, a valid test also has to have good reliability 
(though it is possible to have a reliable test which is not valid). 

Procedures for establishing reliability of competency-based tests differ from those 
for norm-referenced tests. Well-recognized principles for improving the quality of 
test items, found in standard measurement texts, apply to tests of both types, and a 
rigorous review and editorial process should apply to all test items as a means of 
improving reliability. Statistical techniques, such as item analysis, take on a 
unique form for competency-based tests, however, and are commonly applied 
only to “high-stakes” tests. Using item analysis, “high stakes” tests discard weak 
items (often as many as 9 items will be discarded for every one used). When tests 
are used to make “high stakes” decisions such as graduation and admission, it is 
worthwhile to go to considerable expense to assure the reliability and validity of 
the test. When the purpose is “low stakes,” the tests can be shorter and of lower 
reliability and validity. It is important not to make “high stakes” decisions with 
“low stakes” tests. However, it is impractical and too costly to make “low stakes” 
decisions with “high stakes” tests. 



* In practice, a learner learns a bit just by taking a test, so a re-administration of the exact same test would produce 
a higher score. To prevent this problem, it’s common to use two parallel forms of the same test. This is called test- 
retest reliability. 
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Note that it is not true that some question formats are inherently more reliable or 
valid than others. For example, teachers often mistrust multiple-choice questions 
- perhaps because they have seen (and written) so many that lack reliability and 
validity. But other question formats, such as essay questions, can have equally 
bad problems with reliability and validity unless they are carefully designed and 
scored. The same applies to performance-based measures (such as projects): the 
apparent realism of the task (face validity) often masks significant reliability 
problems in scoring the work products, and the poor reliability in turn limits 
validity. 



Fig. 3, below, summarizes the tradeoffs between test purpose, length and cost, and 
reliability. 



Purpose (Stakes) 


Length/Cost 


Reliability 


Low 


Short/Low 


Low 


Moderate 


Long/Low 


Moderate 


High 


Long/High 


High 



• Fig. 3: Tradeoffs between test purpose, cost, and reliability 

Accountability has placed new emphasis on the need for tests which accurately 
mirror the standards and benchmarks (and thus are valid and competency-based), 
and which are of appropriate reliability. For high stakes purposes such as state 
standards tests, reliability (and cost and length) should be high. For low stakes 
purposes such as placement, progress and end-of-unit tests used in the course of 
instruction, reliability (and cost and length) should be moderate. Teacher-written 
tests, because of their often poor reliability and validity, may be unwise even for 
low-stakes purposes. 



Reliability and Validity Procedures Used for PLATO Tests 



All PLATO® test systems have been developed using procedures and design 
standards that follow accepted professional practices for low-stakes competency- 
based, criterion-referenced tests. 
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Validity 



For competency-based tests, the critical validity issues concern alignment of test 
items to objectives, benchmarks and standards (content validity), as well as 
procedures for domain sampling. PLATO Learning makes no claims of predictive 
validity for any of its tests. Allowing for variations due to requirements for each 
test type, these general steps are followed in development of all PLATO tests: 

1. A detailed content map is developed, by use of task analysis procedures 
(for courseware) and detailed analysis of state and national curriculum 
standards and benchmarks (for all products). This analysis is typically 
much more exhaustive than the domain specification which suffices for 
norm-referenced tests, and takes into account issues of knowledge type 
and structure and Taxonomy level as well as topic. 

2. A test specification is developed, with model items corresponding to each 
objective to be tested, item specifications such as difficulty level (cognitive 
load), reading level, and typical errors to capture, as well as number of 
items to be developed per objective. This is in place of the domain sample 
specification used in norm-referenced tests. 

3. The test specification document is then reviewed by subject matter experts 
and instructional design specialists. Criteria for the review include content 
accuracy and distribution of items, item writing style, correspondence to 
objectives and standards in content and Taxonomy level, and technical 
feasibility. 

4. Where model items have been published to correspond to the target 
curriculum standards and tests, or where research has demonstrated that a 
particular item format is preferable, the PLATO® items follow these 
recommendations. Item formats vary according to the system. Most 
common are 3- and 4-choice multiple-choice formats. PLATO 
courseware also includes a variety of constructed-response formats. 
Binary response (true/false, etc.) formats are avoided unless the reference 
standard uses them, because of their inherently low reliability. 

Some practice tests also emulate the exact physical format of the target 
high-stakes test. This is called full idiom coverage. 

Also of note is the ability of the National Writing Practice Test to grade 
responses to essay questions for content. This system was developed in 
partnership with Educational Testing Service. It uses the eRater essay test 
grading system, a highly valid and reliable technology used for the essay 
components of many ETS high-stakes tests. 

5. Guidelines are also applied in item writing and review to minimize bias by 
age, gender, economic status, race and ethnicity. For example, care is 
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taken to use vocabulary, figures of speech and contexts which are widely 
identified and understood, and to avoid cultural references which are likely 
to be too specific, such as references to holidays, sports, colloquial and 
regional expressions, unhealthy foods, gender or ethnic stereotypes, and 
the like. 

6. All items are reviewed for content accuracy by an independent subject 
matter expert (SME). 

7. All items are peer reviewed for clarity and style, as well as sound 
instructional design. 

Reliability 



Additional design guidelines and development practices help to assure reliability. 

8. All items are peer-reviewed and edited for use of recommended item 
writing practices appropriate to that item type. Guidelines used are drawn 
from standard psychometric references, such as Gronlund (1993), 
Haladyna (1997), and Osterlind (1998). 

9. Item formats are designed to minimize extraneous difficulty in item 
comprehension and answer entry due to limitations of the user interface. 
Thus, for example, keyboarding skills needed are minimal for most item 
formats, and care is taken to keep all information needed on screen at 
once, wherever possible. 

10. All items undergo technical review and testing to verify that screen 
displays, user interface and answer analysis work correctly. 

11. For pre-deftned tests, additional quality controls are applied in a final 
review. Each test is taken with all questions answered incorrectly. The 
process is repeated with all questions answered correctly. Each question is 
given an in-depth check, including evaluation of formatting and 
appearance. The following questions are asked: 

• Are there enough questions to test for mastery of a skill? 

• Are the test items age- and grade-level appropriate? 

• Does each skill appear in the listed objectives and does each 
objective appear on the test? 

• Is each question answerable (not vague, subjective, or 
incomplete)? 
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The test is taken again, answering the questions correctly. Each possible 
answer is checked in-depth. A time-on-task analysis is performed on the 
test, to determine whether a student could take it in less than 45 minutes. 
The following questions are asked: 

• Is there a correct answer for each question? 

• Do any answers appear twice? 

Each test is generated repeatedly to check all possible combinations of 
questions and answers. Any tests that fail to pass all criteria are edited and 
put through the same rigorous process again. 

To minimize item exposure in re-testing, all testing systems use random selection 
of items from a defined pool of items (exceptions are certain reading 
comprehension tests, which randomly select passages accompanied by clusters of 
items, and certain practice tests which are fixed format). The PLATO® Link 
system adds a most-recent use algorithm to further minimize item exposure. 
Predefined tests select between 3-5 items per objective; user-defined tests may 
select any number of items per objective. Refer to the next sections for 
specifications of item pool size by test. 

At this time, statistical analysis of item reliability is performed only on the 
PLATO® Link item bank. Test reliability statistics are not available. 

Part 3 of this paper summarizes and compares the two general-purpose testing 
system options available to PLATO users: PLATO® Orion GATE, and PLATO® 
Link. Part 4 summarizes and compares the testing systems which are embedded in 
PLATO courseware and designed to support its use. 
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Part 




PLATO Learning Comprehensive Testing Systems 

PLATO Learning offers two comprehensive testing systems for reading, writing 
and mathematics curricula. These systems provide a convenient way to select or 
construct standards-referenced low-stakes tests, to mark them and to report results. 
They are linked to instructional resources (including PLATO courseware) by 
reference. The Grading and Testing Engine (GATE) is part of the Orion 
curriculum planning and management system, and is an ideal tool for Orion users. 
For clients who do not use the Orion system, or who need a larger item pool, 
PLATO Learning offers PLATO ® LINK. 

PLATO Learning also offers the Simulated Test System (STS): a series of practice 
tests which emulate particular high-stakes tests in content and item format and 
which provide prescriptions directly to PLATO courseware. The K-12 simulated 
tests are delivered through the PLATO® LINK system. Other simulated tests (of 
GED 2002, Pre-Professional Skills Test, and the National Writing Test) do not 
require a PLATO® LINK subscription and do not provide access to other PLATO® 
LINK facilities. 

The National Writing Test was developed in partnership with Educational Testing 
Service. It uses their highly-regarded eRater essay question scoring technology - 
the same technology used to score essays in ETS’ high-stakes tests. The National 
Writing Test provides essay questions typical of those included on state 
competency tests. The full text of each learner’s essay is scored for content as 
well as mechanics. The learner instantly receives scoring information and model 
essays for comparison. 

Both PLATO® Orion GATE and PLATCf LINK address a wide variety of key 
needs for internal, low-stakes testing, including: 

• Transparency: The systems are understandable and clear to students, 
parents, and educators. They provide teachers and students with timely, 
effective feedback to facilitate progress toward meeting standards, and 
ensure that teachers of all grade levels work as a team to meet standards 
responsibly and responsively. Parents receive user-friendly and helpful 
information about student performance. 

• Practicality: Tests are easy to compile and administer, and students and 
teachers find the system easy to use. The tests are minimally intrusive, 
since they can provide frequent, short tests rather than time-consuming, 
one-time tests, and they provides unlimited access for all parties. They are 
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